{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.sys.path.append(os.path.dirname(os.path.abspath('.')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据准备"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from datasets.dataset import load_breast_cancer\n",
    "from model_selection.train_test_split import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "data=load_breast_cancer()\n",
    "X=data.data\n",
    "Y=data.target\n",
    "\n",
    "X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据$X$是一个$(n{\\times}m)$的矩阵，每一行是一个样本，每一列代表一个特征："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "n=X_train.shape[0]\n",
    "m=X_train.shape[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "标签$Y$是一个列向量，其行数与$X$相同："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y_train = Y_train.reshape((n, 1))\n",
    "Y_test = Y_test.reshape((-1, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 粗略模型\n",
    "\n",
    "模型表达式为：\n",
    "$$\n",
    "\\hat{Y}=\\sigma{(XW+b)}\n",
    "$$\n",
    "其中\n",
    "$$\n",
    "\\sigma(x)=\\frac{1}{1+e^{-x}}\n",
    "$$\n",
    "权重系数$W$的形状为$(m,1)$，偏置系数$b$为单变量系数。这里注意sigmoid函数的曲线，**只有当$XW+b$处于非常有限的范围内如$[-5,5]$时才能被sigmoid函数划分到$[0,1]$区间，而$XW+b$相当于是对原始数据的一个线性回归，想要线性回归的结果落在$[-5,5]$的范围内，需要对数据做预处理，或者缩小初始的$W$与$b$。实际发现缩小$W$的效果远优于对数据的预处理。**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sigmoid(x):\n",
    "    return 1/(1+np.exp(-x))\n",
    "\n",
    "# 缩小的初始权重参数\n",
    "W = 0.001*np.random.randn(m).reshape((m, 1))  # 权重\n",
    "b = 0  # 偏置\n",
    "\n",
    "Y_hat=sigmoid(np.dot(X_train, W)+b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "模型的损失函数为：\n",
    "$$\n",
    "\\begin{aligned}\n",
    "L&=-\\sum\\limits_{i=1}^n[y^{(i)}\\ln{\\hat{y}^{(i)}}+(1-y^{(i)})\\ln{(1-\\hat{y}^{(i)})}] \\\\\n",
    "&=-\\frac{1}{n}[Y^{T}\\ln{\\hat{Y}}+(1-Y)^{T}\\ln{(1-\\hat{Y})}] \\\\\n",
    "\\end{aligned}\n",
    "$$\n",
    "损失函数关于参数$W$与$b$的梯度可以求得：\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\frac{\\partial{L}}{\\partial{W}}&=\\frac{1}{n}X^{T}{\\cdot}(\\hat{Y}-Y) \\\\\n",
    "\\frac{\\partial{L}}{\\partial{b}}&=\\frac{1}{n}{\\cdot}[1,1,...,1](\\hat{Y}-Y) \\\\\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "dW = X_train.T.dot(Y_hat - Y_train) / n\n",
    "db = np.sum(Y_hat - Y_train) / n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "参数的迭代更新公式：\n",
    "$$\n",
    "W:=W-{\\alpha}\\frac{\\partial{L}}{\\partial{W}}, \\quad b:b-{\\alpha}\\frac{\\partial{L}}{\\partial{b}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_iter=2000\n",
    "alpha=0.00001        # 注意学习率过大会导致震荡，然后误差越来越大\n",
    "\n",
    "for i in range(max_iter):\n",
    "    Y_hat=sigmoid(np.dot(X_train, W)+b)\n",
    "    \n",
    "    dW = X_train.T.dot(Y_hat - Y_train) / n\n",
    "    db = np.sum(Y_hat - Y_train) / n\n",
    "    \n",
    "    W = W - alpha * dW\n",
    "    b = b - alpha * db"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用该模型分别对训练集与预测集做预测。**注意这里有一个实现上的坑，就是经过信号函数处理过的输出无法用于计算logistics regression的交叉熵计算，因为$ln(x)$函数不能接受0作为参数。**所以说如果要设计一个函数可以计算训练模型的交叉熵损失，必须提供模型的$W$与$b$，使用模型的原始输出概率进行计算。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "threshold = 0.5\n",
    "\n",
    "# 注意以下输出值不能用于计算交叉熵，只能用于计算准确率\n",
    "Y_pred_train = np.squeeze(\n",
    "    np.where(sigmoid(np.dot(X_train, W)+b) > threshold, 1, 0))\n",
    "Y_pred_test = np.squeeze(\n",
    "    np.where(sigmoid(np.dot(X_test, W)+b) > threshold, 1, 0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "定义一个Precision函数来评价模型的表现："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9120879120879121 0.9473684210526315\n"
     ]
    }
   ],
   "source": [
    "def ACC(Y_true,Y_pred):\n",
    "    return np.sum(Y_true==Y_pred)/len(Y_true)\n",
    "\n",
    "print(ACC(np.squeeze(Y_train),Y_pred_train),ACC(np.squeeze(Y_test),Y_pred_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "模型简单打包："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.6741321736913188 0.5256622014802633 0.4396315510561449 0.4419725424410019 0.3745682568891109 0.3319956988216062 0.4164478937940294 0.3679559784971766 0.33263426830286424 0.4103189660707004 "
     ]
    }
   ],
   "source": [
    "def logit_reg(X,Y,alpha=0.0001,max_iter=2000,threshold=0.5):\n",
    "    n=X.shape[0]\n",
    "    m=X.shape[1]\n",
    "    \n",
    "    W = 0.001*np.random.rand(m).reshape((m, 1))  # 权重\n",
    "    b = 0  # 偏置\n",
    "\n",
    "    for i in range(max_iter):\n",
    "        Y_hat=sigmoid(np.dot(X_train, W)+b)\n",
    "\n",
    "        dW = X.T.dot(Y_hat - Y) / n\n",
    "        db = np.sum(Y_hat - Y) / n\n",
    "\n",
    "        W = W - alpha * dW\n",
    "        b = b - alpha * db\n",
    "\n",
    "        if i%200==200-1:\n",
    "            Y_hat=sigmoid(np.dot(X_train, W)+b)\n",
    "            L=np.sum(-np.dot(Y.T,np.log(Y_hat))-np.dot((1-Y).T,np.log(1-Y_hat)))/n\n",
    "            print(L,end=' ')\n",
    "\n",
    "    return W,b\n",
    "\n",
    "W,b=logit_reg(X_train,Y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据归一化\n",
    "**Normalization：**\n",
    "$$\n",
    "x=\\frac{x-x_{min}}{x_{max}-x_{min}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "X=np.row_stack((X_train,X_test))\n",
    "\n",
    "X_max=X.max(axis=0)\n",
    "X_min=X.min(axis=0)\n",
    "\n",
    "X_train_norm=(X_train-X_min)/(X_max-X_min)\n",
    "X_test_norm=(X_test-X_min)/(X_max-X_min)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对数据归一化之后再测试模型表现："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.672021062910098 0.6623831579053243 0.6502731868055389 0.6384413361573875 0.6272882617586929 0.6167991348107899 0.6069213403740612 0.5976037249806252 0.5888000848490065 0.5804689725772003 "
     ]
    }
   ],
   "source": [
    "W,b=logit_reg(X_train_norm,Y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因为数据做了归一化，整个数据集上的梯度分布得到了改良，所以可以调大学习率，由此可以看出数据标准化在logistic regression上的威力："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.2178013227443486 0.17037648124448207 0.16455334108979822 0.14799782683921445 0.1398401231894581 0.13376739149466366 0.12859573235845057 0.12411286024441638 0.12018714868013629 0.1167212508622003 "
     ]
    }
   ],
   "source": [
    "W,b=logit_reg(X_train_norm,Y_train,alpha=0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Standardization：**\n",
    "$$\n",
    "x=\\frac{x-x_{\\mu}}{\\sigma}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "X=np.row_stack((X_train,X_test))\n",
    "\n",
    "X_avg=X.mean(axis=0)\n",
    "X_std=X.std(axis=0)\n",
    "\n",
    "X_train_std=(X_train-X_avg)/X_std\n",
    "X_test_std=(X_test-X_avg)/X_std"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.585594948928191 10.015572317554604 15.452489526828806 20.88948901258511 26.326490511963755 31.763492076384765 37.200493643142025 42.63749520998671 48.074496776834714 53.51149834368241 "
     ]
    }
   ],
   "source": [
    "W,b=logit_reg(X_train_std,Y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.724188592470386 10.154844017876874 15.591766645603109 21.02876624157395 26.465767744252595 31.90276930878883 37.33977087555038 42.776772442395234 48.21377400924323 53.650775576090915 "
     ]
    }
   ],
   "source": [
    "W,b=logit_reg(X_train_std,Y_train,alpha=0.0001)"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
